= c(9, 10, 11, 12, 13, 14, 15, 16, 17)
dolly = c("b", "b", "b", "b", "b") bees
Module 3: Vectors and Lists
This lesson is about understanding different data types in R, namely vectors and lists. We will build up to the understanding of what a dataframe really is at a high level. The lab will explore how vector operations are particularly useful in the context of data manipulation and analysis.
Vectors and Lists
Download a copy of Module 3 slides
Download data for Module 3 lab and tutorial
Lab 3
General Guidelines:
You will encounter a few functions we did not cover in the lecture video. This will give you some practice on how to use a new function for the first time. You can try following steps:
- Start by typing
?new_function
in your Console to open up the help page - Read the help page of this new_function. The description might be too technical for now. That’s OK. Pay attention to the Usage and Arguments, especially the argument
x
orx
,y
(when two arguments are required) - At the bottom of the help page, there are a few examples. Run the first few lines to see how it works
- Apply it in your lab questions
It is highly likely that you will encounter error messages while doing this lab Here are a few steps that might help get you through it.
- Locate which line is causing this error first
- Check if you may have a typo in the code. Sometimes another person can spot a typo faster than you.
- If you enter the code without any typo, try googling the error message
- Scroll through the top few links see if any of them helps
- Try working on the next few questions while waiting for answers by TAs
Warm-up
- In the lecture, we covered
c()
,:
,rep()
,seq()
among other ways to create vectors.
- Recreate dolly using
:
. - Create the same vector using
seq()
. - Recreate bees using
rep()
.
- We are now going to use the functions
rnorm()
andrunif()
to initialize vectors.
= rnorm(100)
random_norm = runif(1000) random_unif
How long are the vectors
random_norm
andrandom_unif
? Uselength()
to verify.What are the largest and smallest values in
random_norm
andrandom_unif
? Usemin()
andmax()
.Use
mean()
andsd()
to calculate the mean and standard deviation of the two distributions.Create a new vector with 10000 draws from the standard normal distribution.
rnorm()
by default sets mean = 0 (see?rnorm
). Create a vector of 10000 draws from the normal distribution with mean = 1. Usemean()
to verify.
Notice the functions min()
, max()
, mean()
and sd()
all take a vector with many values and summarize them as one value. These are good to use with summarize()
when doing data analysis on simple dataframes.
Data Types
Use
typeof()
to verify the data types ofdolly
,bees
,random_unif
Coerce dolly to a character vector. Recall we have functions
as.<type>()
for this kind of coercion.Try to coerce bees to type numeric. What does R do when you ask it to turn “b” into a number?
Vectorized Math
- a and b are vectors of length 10. Look at them in the console.
<- 1:10
a <- rep(c(2, 4), 5) b
Add a and b element by element.
Subtract a and b element by element.
Divide a by b element by element.
Multiply a and b element by element.
Raise the element of a to the power of b element by element.
Multiply each element of a by 3 then subtract b
Raise each element of b to the third power.
Take the square root of each element of a.
Calculating Mean and Standard Deviation
Calculating the Mean
In this exercise, we will calculate the mean of a vector of random numbers. Wewill practice assigning new variables and using functions in R.
We can run the following code to create a vector of 1000 random numbers. The function set.seed()
ensures that the process used to generate random numbers is the same across computers.
Note: rf()
is a R command we use to generate 1000 random numbers according to the F distribution, and 10
and 100
are parameters that specify how “peaked” the distribution is.
set.seed(1)
= rf(1000, 10, 100) random_numbers
Write code that gives you the sum of random_numbers
and saves it to a new variable called numbers_sum
:
Hint: To sum the numbers in a vector, use the sum()
function.
Note: You don’t automatically see the output of numbers_sum
when you assign it to a variable. Type numbers_sum
into the console and run it to see the value that you assigned it.
Write code that gives you the number of items in the random_numbers
vector and saves it to a new variable called numbers_count
:
Hint: To count the number of items in a vector, use the length()
function.
Now write code that uses the above two variables to calculate the average of random_numbers
and assign it to a new variable called this_mean
.
What number did you get? It should have been 1.018. If it isn’t, double check your code!
R actually has a built in function to calculate the mean for you, so you don’t have to remember how to build it from scratch each time! Check your above answer by using the mean()
function on the random_numbers
vector.
Calculating the Standard Deviation
Now that you’ve got that under your fingers, let’s move on to standard deviation.
We will be converting the following formula for calculating the sample standard deviation into code:
\(s = \sqrt{\frac{\sum_{i=1}^n (x_i - \overline{x})^2} {n-1}}\)
For this, we’ll review the concept of vectorization. This means that an operation like subtraction will act on all numbers in a vector at the same time.
Subtract this_mean
from the random_numbers
vector. Did each number in random_numbers
change?
Try to write the formula for standard deviation in R code using the sqrt()
, sum()
, and length()
functions, along with other operators (^
, /
, -
). Assign it to a new variable called this_sd
. Watch out for your parentheses!
What number did you get for this_sd
, or the standard deviation of random_numbers
? If you didn’t get 0.489704, recheck your code!
R also has a built in function for standard deviation. Check if you calculated the standard deviation correctly by using the sd()
function on the random_numbers
vector.
Making a Histogram of Our Numbers
What do these random numbers look like, anyway? We can use base plotting in R to visualize the distribution of our random numbers.
Run the following code to visualize the original distribution of random_numbers
as a histogram.
hist(random_numbers)
Notice how most of the values are concentrated on the left-hand side of the graph, while there is a longer “tail” to the right? Counterintuitively, this is known as a right-skewed distribution. When we see a distribution like this, one common thing to do is to normalize it.
This is also known as calculating a z-score, which we will cover next.
Calculating a Z-Score
The formula for calculating a z-score for a single value, or normalizing that value, is as follows:
\(z = \frac{x - \bar{x}}{s}\)
This can be calculated for each value in random_numbers
in context of the larger set of values.
Can you translate this formula into code?
Using random_numbers
, this_mean
, and this_sd
that are already in your environment, write a formula to transform all the values in random_numbers
into z-scores, and assign it to the new variable normalized_data
.
Hint: R is vectorized, so you can subtract the mean from each random number in random_numbers
in a straightforward way.
Take the mean of normalized_data
and assign it to a variable called normalized_mean
.
Note: If you see something that ends in “e-16”, that means that it’s a very small decimal number (16 places to the right of the decimal point), and is essentially 0.
Take the standard deviation of normalized_data
and assign it to a variable called normalized_sd
.
What is the value of normalized_mean
? What is the value of normalized_sd
? You should get a vector that is mean zero and has a standard deviation of one, because the data has been normalized.
Making a Histogram of Z-scores
Let’s plot the z-scores and see if our values are still skewed. How does this compare to the histogram of random_numbers
? Run the following code:
hist(normalized_data)
Is the resulting data skewed?
Calculating a T-Score
T-tests are used to determine if two sample means are equal. The formula for calculating a t-score is as follows:
\(t = \frac{\overline{x}_1 - \overline{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\)
where \(\overline{x}_i\) is the mean of the first or second set of data, \(s_i\) is the sample standard deviation of the first or second set of data, and \(n_i\) is the sample size of the \(i\)th set of data.
We’ll first create two data sets of random numbers following a normal distribution:
set.seed(1)
<- rnorm(1000, 3)
data_1 <- rnorm(100, 2) data_2
Here’s how we’ll calculate the mean (x_1
), standard deviation (s_1
), and sample size (n_1
) of the first data set:
<- mean(data_1)
x_1 <- sd(data_1)
s_1 <- length(data_1) n_1
What numeric types do you get from doing this? Try running the typeof()
function on each of x_1
, s_1
, and n_1
. We have you started with x_1
.
typeof(x_1)
[1] "double"
What object type is n_1
?
Can you calculate the same values for data_2
, assigning mean, standard deviation, and length to the variables of x_2
, s_2
, and n_2
, respectively?
What values do you get for x_2
and s_2
?
Now, you should be able to translate the t-score formula (\(\frac{\overline{x}_1 - \overline{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\)) into code, based on the above calculated values.
What did you get for the t-score? You should have gotten 9.243, if not, double check your code!
The t-score’s meaning depends on your sample size, but in general t-scores close to 0 imply that the means are not statistically distinguishable, and large t-scores (e.g. t > 3) imply the data have different means.
Performing a T-Test
Once again, R has a built in function that will perform a T-test for us, aptly named t.test()
. Look up the arguments the function t.test()
takes, and perform a T-test on data_1
and data_2
.
What are the sample means, and are they distinguishable from each other?
Well done! You’ve learned how to work with R to calculate basic statistics. We’ve had you generate a few by hand, but be sure to use the built-in functions in R in the future.
Want to improve this tutorial? Report any suggestions/bugs/improvements on here! We’re interested in learning from you how we can make this tutorial better.